perm filename VIS[0,BGB] blob sn#179600 filedate 1974-08-29 generic text, type C, neo UTF8
COMMENT ⊗   VALID 00017 PAGES
C REC  PAGE   DESCRIPTION
C00001 00001
C00003 00002	{⊂C<NαVISION THEORY.λ30P68I425,0JCFA} SECTION 6.
C00005 00003	⊂6.1	A Geometric Feedback Vision System.⊃
C00007 00004		Between the top and  the bottom, between images and  the task
C00009 00005	{|λ10JUFA}
C00012 00006		The  lower part  of the  above
C00017 00007	⊂6.2	Vision Tasks.⊃
C00019 00008		First, there  is the  robot chauffeur  task.   In 1969,   John
C00023 00009
C00028 00010
C00030 00011	⊂6.3	Vision System Design Arguments.⊃
C00037 00012
C00041 00013	⊂6.4	Mobile Robot Vision.⊃
C00045 00014
C00047 00015	{|λ10JAFA}
C00050 00016	
C00054 00017	⊂6.5	Summary and Related Vision Work.⊃
C00062 ENDMK
C⊗;
{⊂C;<N;αVISION THEORY.;λ30;P68;I425,0;JCFA} SECTION 6.
{JCFD}                   COMPUTER VISION THEORY.
{λ10;W250;JAFA}
	6.0	Introduction to Computer Vision Theory.
	6.1	A Geometric Feedback Vision System.
	6.2	Vision Tasks.
	6.3	Vision System Design Arguments.
	6.4	Mobile Robot Vision.
	6.5	Summary and Related Vision Work.

{λ30;W0;I900,0;JUFA}
⊂6.0	Introduction to Computer Vision Theory.⊃

	Computer vision concerns programming a computer  to do a task
that  demands the  use of  an image  forming light  sensor such  as a
television camera.  The theory I intend to elaborate is  that general
3-D  vision is  a continuous  process of  keeping an  internal visual
simulator  in sync with  perceived images of the  external reality, so
that vision tasks  can be done more  by reference to the  simulator's
model  and  less  by  reference  to  the original  images.  The  word
<theory>, as used here,  means simply a set of statements  presenting
a systematic view of  a subject; specifically, I wish  to exclude the
connotation that  the theory is a natural  theory of vision. Perhaps
there can be  such a thing  as an  <artificial theory> which  extends
from the philosophy thru the design of an artifact.

⊂6.1	A Geometric Feedback Vision System.⊃

	Vision systems mediate between images and world models; these
two  extremes  of a  vision system  are  called, in  the  jargon, the
<bottom> and  the <top> respectively.   In  what follows,   the  word
<image> will be used  to refer to the notion of  a 2-D data structure
representing  a picture; a  picture being a rectangle  taken from the
pattern  of  light  formed  by  a  thin  lens  on   the  nearly  flat
photoelectric surface of a  television camera's vidicon. On the other
hand, a  <world model>  is  a data  structure  which is  supposed  to
represent the physical world for the purposes of a task processor. In
particular,  the  main  point of  this  thesis  concerns isolating  a
portion of the world model (called the 3-D geometric world model) and
placing it below most of the other entities that a task processor has
to deal with.  The vision hierarchy, so formed,  is illustrated in box 6.1.
{|λ10;JA}
BOX 6.1 {JC} VISION SYSTEM HIERARCHY.

{JC} Task Processor
{JC} |
{JC} Task World Model
		 The  Top  → {JC} |
{JC} 3-D Geometric Model
{JC} |
		 The Bottom → {JC} 2-D Images
{|λ30;JU}
	Between the top and  the bottom, between images and  the task
world model,  a general vision system has three distinguishable modes
of operation: recognition,  verification and description. Recognition
vision can be  characterized as bottom up, what is  in the picture is
determined  by extracting  a set  of features  from the image  and by
classifing them  with respect  to  prejudices which  must be  taught.
Verification vision is top  down or model driven vision, and involves
predicting an image followed by  comparing the predicted image and  a
perceived  image for  differences  which  are  expected but  not  yet
measured. Descriptive vision is bottom  up or data  driven vision and
involves converting  the image into  a representation  that makes  it
possible (or easier) to do the desired  vision task.  I would like to
call  this  third  kind  of  vision  "revelation  vision"  at  times,
although the  phrase "descriptive vision"  is the  term used by  most
members of the computer vision community.
{|λ10;JU;FA}
Box 6.2 {JC} THREE BASIC MODES OF VISION.

	1. Recognition Vision - Feature Classification. (bottom up into a prejudiced top).
	2. Verification Vision - Model Driven Vision. (nearly pure top down vision).
	3. Descriptive Vision - Data Driven Vision. (nearly pure bottom up vision).
{|λ30;JU}
	There are now enough concepts to outline a feedback system.
By placing  a 3-D geometric model between top and bottom; recognition vision can
be done mapping 3-D (rather than 2-D) features into ∀he task  world model
with descriptive  vision and verification  vision linking
the  2-D and  3-D models  in a  relatively dumb,  mechanical fashion.
Previous attempts to use recognition vision,  to  bridge directly the
gap between 2-D  images (of  3-D objects)  and the  task world
model, have  been frustrated  because  the characteristic  2-D  image
features of  a  3-D object  are very  dependent on  the 3-D  physical
processes  of occultation,  rotation  and illumination.   It is these
processes that  will have  to be  modeled and  understood before  the
features  relevant to  the task  processor  can be  deduced from  the
perceived  images.  The arrangement  of  these elements  is diagramed
below.{|λ10;JA}
Box 6.3 {JC} BASIC FEEDBACK VISION SYSTEM DESIGN.

{JC} Task World Model
{JC} ↑
{JC} RECOGNITION
{JC} ↑
{JC} 3-D geometric model
{JC} ↑            ↓
{JC} DESCRIPTION        VERIFICATION
{JC} ↑            ↓
{JC} 2-D images
{|λ30;JU}
	The  lower part  of the  above
diagram is  the feedback  loop of  the  3-D geometric
vision system. Depending on circumstances,  the vision  system may
run almost  entirely top-down  (verification vision)  or
bottom-up  (revelation vision).  Verification vision  is all  that is
required in a well known predictable environment; whereas,  revelation
vision is required  in a brand new (tabula  rasa) or rapidly changing
environment.  Thus revelation and verification form a loop, bottom-up
and top-down. First,  there is  revelation that unprejudically builds
a  3-D model;  and second,   the model  is verified by  testing image
features predicted from the model.  This loop  like structure
has been  noted before by others;  it is a form  of what Tenenbaum (71)
called  <accommodation>  and it  is  a  form of  what  Falk (69) called
<heuristic vision>; however I will go along with what  I think is the
current majority of vision workers who call it <feedback vision>.

	Completing  the   design,     the  images   and  worlds   are
constructed, manipulated and compared by a variety of processors, the
topmost of which is the  task processor. Since the task processor  is
expected to  vary with the application,  it would be expedient  if it
could  be isolated as a  user  program that calls on utility routines
of  an appropriate  vision  sub-system.  Immediately below  the  task
processor  are the  3-D  recognition routines  and  the 3-D  modeling
routines. The  modeling routines  underlie most  everything  because
they are used to create, alter and access the models.{
|;λ10;JAFA}
Box 6.4	{JC} PROCESSORS OF A 3-D VISION SYSTEM.
{↓}	
	0. The task processor.
	1. 3-D recognition.
	2. 3-D modeling  routines.
	3. Reality simulator.
{↑;W560;}
	4. Image analyser.
	5. Image synthesizer.
	6. Locus solvers.
	7. Comparators: 2D and 3D.
{|;λ30;JUFA}
	The remaining processors include the  reality simulator which
does mechanics for modeling motion, collision and gravity.
Also there  are image  analyzers,   which  do image  enhancement  and
conversions  such as  converting  video rasters  into line  drawings.
There  is an  image synthesizer, which  does hidden  line and surface
elimination, for verification by comparing synthetic  images from the
model  with perceived  images of  reality. There  are three  kinds of
locus solvers that compute numerical descriptions for cameras,  light
sources and physical objects.   Finally,  there is of  course a large
number of (at least  ten) different compare processors for confirming or
denying correspondences among entities in each of the different kinds
of images and 3-D models.

⊂6.2	Vision Tasks.⊃

	The 3-D  vision research problem  being discussed is  that of
finding  out how to  write programs that  can see in  the real world.
Related vision problems  include: modeling  human
perception,  solving visual  puzzles (non-real world), and developing
advanced automation  techniques (ad hoc vision). In order to approach
the problem,  specific  programming tasks are proposed  and solutions
are sought, however a programming task is different than a reseach problem
because many  vision
tasks can be  done without vision.   The vision solution to  be found
should  be  able  to  deal with  real  images,    should include  the
continuity of the  visual process in  time and  space, and should  be
more general  purpose and less ad hoc.    These  three  requirements
(reality,   continuity, and generality) will be  developed by surveying
six examples of computer vision tasks.{Q}
{|;λ10;JAFA}
BOX 6.5{JC}	SIX EXAMPLES OF COMPUTER VISION TASKS.
{↓}
<Cart Related Tasks>.
	1. The Chauffeur Task.
	2. The Explorer Task.
	3. The Soldier Task.
{↑;W650;}
<Table Top Related Tasks>.
	4. Turntable Task.
	5. The Blocks Task.
	6. Machine Assembly Tasks.
{|;λ30;JUFA}
	First, there  is the  robot chauffeur  task.   In 1969,   John
McCarthy asked  me to consider the vision  requirements of a computer
controlled car such as he depicted in an unpublished essay.  The idea
is that a user of such  an automatic car would request a destination;
the  robot would select a  route from an  internally stored road map;
and it would then proceed to its destination using  visual data.  The
problem  involves  representing the  road  map  in  the computer  and
establishing the correspondence between the map and the appearance of
the road  as  the automatic  chauffeur drives  the  vehicle along  the
selected route.   Lacking a computer controlled car,  the problem was
abstracted to that of tracing a route along the driveways and parking
lots that  surround the Stanford  A.I. Laboratory using  a television
camera  and transmitter mounted on a  radio controlled electric cart.
The robot chauffeur task could  be solved by non-visual means  such as
by railroad  like guidance or  by inertial guidance;  to preserve the
vision aspect  of the  problem,   no particular  artifacts should  be
required along a route (landmarks must be found, not placed); and the
extent of inertial dead reckoning should be noted.

	Second,  there is the task of a robot explorer.  In (McCarthy
1964) there is a description of a robot for exploring Mars. The robot
explorer was required  to run for long periods of  time without human
intervention because the signal transmission time to Mars is as great
as twenty minutes and because  the 23.5 hour Martian day  would place
the  vehicle out of  Earth sight for  twelve hours at a time.   (This
latter difficulty could be avoided at the expense of having a set  of
communication relay satellites in orbit around Mars.) The task of the
explorer  would be to drive  around mapping the  surface, looking for
interesting features,  and doing various experiments.  To be prudent,
a Mars explorer  should be able to navigate without  vision; this can
be  done  by driving  slowly  and by  using a  tactile  collision and
crevasse detector.  If the television system fails,  the core samples
and so on  can still be collected at  different Martian sites without
unusual risk to the vehicle due to visual blindness.

	The third vision  task is that  of the robot soldier,   tank,
sentry, pilot or  policeman.  The problem has several forms which are
quite similar to the chauffeur  and the explorer with the  additional
goal of doing something to coerce  an opponent.  Although this vision
task has  not yet been explicitly attempted at Stanford,  to the best
of my knowledge, the reader should be warned that a thorough solution
to any of the other  tasks almost assures the Orwellian technology to
solve this one.

	Fourth, the turntable task is to construct a 3-D model  from
a sequence of 2-D  television images taken of an object  rotated on a
turntable.   The turntable task was  selected as a simplification of
the explorer  task and  is an  example of  a  nearly pure  descriptive
vision task.

	Fifth, the classic blocks vision  task consists of two parts:
first  convert a  video image into  a line  drawing; second,   make a
selection from a  set of predefined  prototype models of blocks  that
accounts  for the line  drawing.   In my opinion,   this  vision task
emphasizes three pitfalls:  single image vision,   line drawings  and
blocks. The greatest pitfall, in the usual blocks vision task, is the
presumption  that a  single  image is  to be  solved;  thus diverting
attention  away  from  the   two  most  important  depth   perception
mechanisms which are motion parallax  and stereo parallax. The second
pitfall is that the usual notion of a perspective line drawing is not
a natural intermediate state; but is rather a  very sophisticated and
platonic geometric  idea. The perfect line  drawing lacks photometric
information; even a line drawing  with perfect shadow lines  included
will not resemble anything  that can readily be gotten  by processing
real television pictures.  Curiously, the lack of success in deriving
line drawings  from real  television images  of real  blocks has  not
dampened  interest in  solving the  second part  of the  problem. The
perfect  line  drawing puzzle,  was  first  worked on  by  Guzman (68) and
extended to perfect shadows by Waltz (72); nevertheless, enough remains so
that  the puzzle  will  persist on  its  own merits,   without  being
closely relevant to real world  computer vision.  Even assuming  that
imperfect line drawings are given,  the blocks themselves,  have
lead such researchers as Falk (69) and Grape (73) to concentrate on vertex/edge
classification schemes which have  not been extended beyond the  blocks
domain. The  blocks task could  be rehabilitated by  concentrating on
photometric   modeling  and   the  use  multiple   images  for  depth
perception.

	Sixth,  the Stanford  Artificial  Intelligence Laboratory  has
recently  (1974) begun  work on a  National Science  Foundation Grant
supporting research in  automatic machine  assembly.  In  particular,
effort  will  be  directed  to  developing  techniques  that  can  be
demonstrated  by  automatically  assembling  a  chain  saw  gasoline
engine. Two  vision questions  in such  a machine  assembly task  are,
where  is the part  and where  is the hole;  these questions  will be
initially handled by  composing ad  hoc part and  hole detectors  for
each vision step required for the assembly.

	The point of this  task survey was to illustrate what  is and is
not a  task requiring real 3-D vision; and  to point out that caution
has to be taken  to preserve the vision aspects  of a given task.  In
the usual course  of vision projects, a single task  or a single tool
unfortunately  dominates the research;  my work is  no exception, the
one tool is 3-D modeling,  and the task that dominated  the formative
stages  of the  research is  that of  the robot  chauffeured cart.   A
better understanding of the ultimate nature of computer vision can be
obtained by keeping the several tasks and the several tools in mind.

⊂6.3	Vision System Design Arguments.⊃

	The physical information most directly  relevant to vision is
the location,  extent and light scattering properties of solid opaque
objects; the location,   orientation  and projection of  the camera  that
takes the  pictures; and the  location and  nature of the  light that
illuminates  the world.    The transformation  rules of  the everyday
world that  a  programmer  may assume,  a  priori,  are the  laws  of
physics.  The arguments against  geometric modeling divide
into two categories: the reasonable and the intuitive.
The reasonable  arguments  attack 3-D  geometric modeling  by
comparing it  to another modeling alternative, some of which are
listed in Box 6.6.  Actually, the  domains
of  efficiency of  the  possible  kinds  of models  do  not
greatly overlap;  and an artificial intellect  will have some portion of
each  kind.  Nevertheless, I  feel  that  3-D  geometric  modeling  is
superior for  the task at  hand, and that  the other models  are less
relevant to vision.{Q}
{|;λ10;JAFA}
BOX 6.6{JVJC} Alternatives to 3-D Geometric Modeling in a Vision System.{I∂20,0;}
		1. Image memory and with only the camera model in 3-D.
		2. Statistical world model, e.g. Duda & Hart.
		3. Procedural Knowledge, e.g. Hewitt & Winograd.
		4. Semantic knowledge, e.g. Wilkes & Shank.
		5. Formal Logic models, e.g McCarthy & Hayes.
		6. Syntactic models.
{|;λ30;JUFA}
	Perhaps the best alternative  to a 3-D geometric model  is to
have  a library  of little  2-D images  describing the  appearance of
various 3-D loci from given directions.  The advantage would  be that
a sophisticated image  predictor would not be required;  on the other
hand the  image library is potentially quite large and that even with
a huge  data base  new views  and  lighting of  familiar objects  and
scenes  cannot be  anticipated. A second alternative is the statistical
world model used in the pattern recognition paradigm.
Such modeling might be added to the geometric model;  however, alone
the statistical abstraction of world features in the presence
of occultation,  rotation and illumination seems as hopeless
as the abstraction of a man's personality from the 
pattern of tea leaves in his cup.

	Procedural knowledge models  represent the world in  terms of
routines (or actors) which either know or can compute the answer to a
question about the world. Semantic models represent the world in term
of a data structure of conceptual statements; and formal logic models
represent the  world in terms of first order predicate calculus or in
terms of a  situation calculus. The  procedural, semantic and  formal
logic world  models are all general enough  to represent a
vision model and in a theoretical sense they are merely other notations
for  3-D  geometric modeling.    However  in  practice,  these  three
modeling   regimes  are  not   efficient  holders   and  handlers  of
quantitative geometric  data; but are  rather intended  for a  higher
level  of  abstract reasoning.  Another  alleged  advantage of  these
higher  models  is  that they  can  represent  partial  knowledge and
uncertainty,  which  in  a  geometric  model  is   implicit, in  that
structures  are  missing or  incomplete.  For  example, McCarthy  and
Feldman demand that when a robot has only seen the front of an office
desk that it should be able to  draw inferences from its model about the back
of the desk; I  feel that this so called advantage is not required by
the problem  and that basic  visual modeling  is on  a more  agnostic
level.

	The syntactical  approach to  descriptive vision  is that  an
image  is a sentence  of a picture  grammar and  that consequently the
image description should be given in terms of a sequence of grammar
transformations rules. Again this  paradigm is valid in principle but
impractical   for  real   images  of   3-D  objects   because  simple
replacement rules  cannot readily  express rotation,   perspective,
and photometric  transformations. On the other  hand, the syntactical
model has been used to describe perfect line drawings of 3-D objects,
(Gips 74).

	The intuitive arguments  include the opinions  that geometric
modeling is too numerical, too exact, or too non-human to be relevant
for computer vision research. Against such intuitions, I wish to pose
two fallacies. First,  there is the natural mimicry  fallacy, which is
that it  is false to insist that a machine must mimic nature in order
to achieve  its  design  goals. Boeing  747's  are not  covered  with
feathers;  trucks do  not  have legs;  and computer  vision  need not
simulate human vision.   The advocates of  the uniqueness of  natural
intelligence  and perception  will  have to  come  up with  a  rather
unusual  uniqueness  proof to  establish  their conjecture.    In the
meantime, one  should be  open  minded about  the potential  forms  a
perceptive consciousness can take.

	Second,  there is  the self  introspection fallacy,  which is
that  it is false  to insist  that one's introspections  about how he
thinks and  sees are  direct observations  of thought  and sight.  By
introspection  some conclude that  the visual  models (even on  a low
level) are essentially qualitative rather than quantitative.  My belief
is that the vision processing of the  brain is quite quantitative and
only passes into qualities at a higher level of processing. In either
case, the exact details  of human visual processing are  inaccessible
to conscious self introspection.

	Although describing  the above two  fallacies might  soften a
person's  prejudice  against  numerical  geometric  modeling,    some
important argument or idea is missing that would be  convincing short
of the  final achievement of  computer vision. Contrariwise,   I have
not heard an argument that would change my prejudice in favor of such
models. Nevertheless  beyond  prejudice, my  theory  would be  proved
wrong  if a  really  powerful computer  vision system  is  ever built
without using any  geometric models  worth speaking of,   perhaps  by
employing an elaborate stimulus response paradigm.

⊂6.4	Mobile Robot Vision.⊃

	The elements  discussed so far  will now be  brought together
into a system design for performing mobile robot vision. The proposed
system is illustrated below in the block diagram in Box 6.7. (The
diagram is called a mandala in that
a <mandala> is any circle-like system diagram). Although, the robot
chauffeured cart  was the  main task  theme for this research;  I have
failed  to date, August 1974,  to achieve  the hardware  and software
required to  drive  the  cart around  the  laboratory under  its  own
control. Nevertheless,   this necessarily theoretical cart system has
been of  considerable  use  in developing  the  visual  3-D  modeling
routines and theory, which are the subject of this thesis.
{|;JV;FA}
BOX 6.7{JC} CART VISION MANDALA.
{W300;λ4;F2}
 →→→→→→→→→→→→→→→→→→→ PERCEIVED →→→→→→ REALITY →→→→→→ PREDICTED →→→→
 ↑	               WORLD         SIMULATOR         WORLD      ↓
 ↑  								  ↓
 ↑								  ↓
 ↑                   PERCEIVED →→→→→→  CART →→→→→→→→ PREDICTED →→→↓
 ↑	            CAMERA LOCUS      DRIVER        CAMERA LOCUS  ↓
 ↑	                ↑		↓		   	  ↓
 ↑	                ↑		↓		   	  ↓
 ↑                      ↑	      THE CART	     PREDICTED→→→→↓
BODY                 CAMERA			     SUN LOCUS 	  ↓
LOCUS		     LOCUS				 	  ↓
SOLVER		     SOLVER				          ↓
 ↑			↑				          ↓
 ↑			↑			 	          ↓
REVEAL 	             VERIFY				       IMAGE  
COMPARE		     COMPARE				 SYNTHESIZER
 ↑   ↑	 	      ↑   ↑				          ↓
 ↑   ↑                ↑   ↑ 				          ↓
 ↑   ←←	PERCEIVED→→→→→↑   ↑←←←←←←←←←←←←←←←←←←←←	PREDICTED  ←←←←←←←↓
 ←←←←← MOSAIC IMAGE			      MOSAIC IMAGE        ↓
	   ↑					   ↑	          ↓
	   ↑					   ↑	          ↓
	   ↑					   ↑              ↓
	PERCEIVED			        PREDICTED         ↓
      CONTOUR IMAGE			      CONTOUR IMAGE       ↓
	   ↑					   ↑ 	          ↓
	   ↑					   ↑	          ↓
	   ↑					   ↑	          ↓
	PERCEIVED				PREDICTED ←←←←←←←←←
       VIDEO IMAGE			       VIDEO IMAGE
	   ↑
	   ↑
	   ↑
       TELEVISION
	 CAMERA

{|;λ30;JUFA}
	The   robot   chauffeur   task   involves   establishing   the
correspondence between an internal road map and the appearance of the
road in order to steer a vehicle along a predefined path. For a first
cut, the planned route  is assumed to be clear, and  the cart and the
sun  are assumed  to be the  only movable  things in  a static world.
Dealing with moving  obstacles is  a second problem,   motion thru  a
static world must be dealt with first.

	The cart  at the Stanford  Artificial Intelligence Laboratory
is intended for outdoors use and consists of a piece of plywood, four
bicycle wheels, six electric motors, two car batteries,  a television
camera,   a television transmitter, a box of  digital logic, a box of
relays,   and a  toy airplane  radio receiver.    (The vehicle  being
discussed is  not "Shaky",   which belongs  to the  Stanford Research
Institute's  Artificial Intelligence Group.  There  are two A.I. labs
near Stanford and  each has a  computer controlled vehicle.) The  six
possible cart actions are: run forwards,  run backwards, steer to the
left,  steer to the right, pan camera to the left,  pan camera to the
right.   Other than  the television  camera,   there is no  telemetry
concerning the state of the cart or its immediate environment.
{|;λ10;JAFA}
BOX 6.8 {JC} A POSSIBLE CART TASK SOLUTION.
	 	1. Predict (or retrieve) 2-D image features.
		2. Perceive (take) a television picture and convert into features.
		3. Compare (verify)  predicted and perceived features.
		4. Solve for camera locus.
		5. Servo the cart along its intended course.
{|;λ30;JUFA}
	The solution to the  cart problem, begins with the  cart at a
known  starting position  with a  road map  of visual  landmarks with
known loci. That is,  the upper leftmost  two rectangles of the  cart
mandala  are initialized  so that  the perceived  cart locus  and the
perceived world correspond with  reality.  Flowing across  the top of
the mandala, the  cart driver, blindly  moves the cart forward  along
the desired route by dead reckoning (say the cart moves five feet and
stops) and the driver updates the predicted cart locus.  The  reality
simulator is  an identity in  this simple case  because the  world is
assumed static.  Next the image synthesizer uses the predicted world,
camera and sun to compute a predicted image containing  the landmark
features  expected to  be in  view.  Now, in  the lower  left of  the
mandala,  the cart's television camera takes  a perceived picture and
(flowing upwards) the picture  is converted into a form  suitable for
comparing and  matching with the  predicted image. Features  that are
both predicted  and perceived  and found  to match  are used  by  the
camera locus  solver to compute  a new  perceived camera locus  (from
which  the cart locus can  be deduced). Finally the  cart driver compares
the perceived and  the predicted cart locus  and corrects its  course
and moves the cart again, and so on.

	The remaining limb of the cart mandala is invoked in order to
turn the  chauffeur into an explorer.   Perceived images are compared
in time by  the reveal compare  and new features  are located by  the
body locus solver and placed into the world model. The generality and
feasibility  of such  a cart  system depends  almost entirely  on the
representation of the world and the representation of image features.
(The  more general,   the less  feasible). Four smaller  cart systems
might be possible using simpler 3-D models.

	A first system might consist of a road map,  a road model,  a
road model  generator,   a solar  ephemeris,   an image predictor  an
image comparator, a  camera locus solver, and a course servo routine.
The roadways and  nearby environs are entered  into the computer.  In
fact,  real roadways  are constructed  from a  two dimensional  (X,Y)
allignment  map showing where the center of  the road goes as a curve
composed  of  line  segment  and  circular  arcs;  and   from  a  two
dimensional (S,Z)  elevation diagram, showing the height  of the road
above sea  level  as a  function  of distance  along  the road  in  a
sequence  of  linear   grades  and  vertical  arcs  which   (not  too
surprising) are nearly cubic splines. A second version, might be made
like the first except that the road model, road model generator,  and
image predictor  are replaced by a  library of road images.   In this
system the robot vehicle is trained by being driven down the roads it
is suppose  to follow. A  third system  also might  be made like  the
first except that the road map is not initially given, and indeed the
road is no  longer presumed to  exist.  Part  of the problem  becomes
finding a road,  a road in  the sense of  a clear area;  this version
yields the cart explorer and if the clear area is found quite rapidly
and the  world is updated  quite frequently,  the explorer  can be  a
chauffeur that can handle obstacles and moving objects.

⊂6.5	Summary and Related Vision Work.⊃

	To recapitulate, three vision system design requirements were
postulated: reality,  generality,  and continuity. These requirements
were illustrated  by discussing  a number  of  vision related  tasks.
Next, a vision  system was described as mediating  between 2-D images
and  a world model;  with the  world model being  further broken down
into a  3-D geometric  model and a  task world  model. Between  these
entities  three  basic  vision  modes were  identified:  recognition,
verification and  revelation (description).   Finally,   the  general
purpose vision system was depicted as  a quantitative and description
oriented feedback cycle  which maintain a 3-D geometric model for the
sake of higher qualitative,  symbolic, and recognition  oriented task
processors. Approaching the vision system in greater detail; the role
of  seven (or so)  essential kinds of processors  were explained: the
task processor,   3-D modeling routines,   reality simulator,   image
analyser,   image synthesizer, comparators,   and locus  solvers. The
processors and  data  types  were assembled  into  a  cart  chauffeur
system.

	Larry Roberts is  justly credited for doing the  seminal work
in  3-D Computer  Vision; although his  thesis (Roberts  63) appeared
over ten  years  ago the  subject  has  languished dependent  on  and
overshadowed by  the four  areas called:  Image Processing,   Pattern
Recognition,   Computer  Graphics,     and  Artificial  Intelligence.
Outside the computer  sciences, workers in psychology,  neurology and
philosophy also seek a theory of vision.

	Image  Processing  involves  the  study  and  development  of
programs that enhance,  transform and compare 2-D images.  Nearly  all
image processing work can eventually be applied to computer vision in
various circumstances. A survey of this field can be found in an
article by  Rosenfeld(69).   Image Pattern  Recognition involves  two
steps: feature  extraction and classification.   A comprehensive text
about this field with respect to computer vision, has been written by
(Duda and Hart 73).  Computer Graphics is the  inverse of descriptive
computer vision.   The problem  of computer graphics  is to synthesis
images from  three dimensional  models;  the problem  of  descriptive
computer vision is  to analyze images into three  dimensional models.
An introductory  text book about this field  would be that of (Newman
and Sproull 73). Finally, there is Artificial Intelligence,  which in
my  opinion is  an  institution sheltering  a  heterogenous group  of
embryonic  computer subjects; the biggest of  the present day orphans
include: robotics,    natural language,    theorem proving,    speech
analysis, vision and planning.  A more narrow and relevant definition
of artificial intelligence is that it concerns the programming of the
robot task processor which sits above the vision system.

	The related vision  work of specific individuals  has already
been mention  in context.  To summarize,   the present vision work is
related to the  early work of Roberts(63)  and Sutherland; to  recent
work  at  Stanford: Falk,    Feldman  and Paul(67),    Tenenbaum(72),
Agin(72),   Grape(73);  to  the work  at MIT:  Guzman,   Horn, Waltz,
Krakaurer; to the work  at the University of Utah:  Warnock, Watkins;
and to work at other places: SRI and JPL. Future progress in computer
vision will proceed in  step with better  computer hardware,   better
computer  graphics  software, and  better  world  modeling  software.
Further  vision work at  Stanford,  which  is related  to the present
theory is  being done  by Lynn  Quam and  Hans  Morevac. The  machine
assembly task  is being pursued  both by the  Artificial Intelligence
Group of  the Stanford Research Institute and by the Hand Eye Project
at Stanford  University.   Because  the  demand for  doing  practical
vision tasks can be satisfied  with existing ad hoc methods or by not
using a  visual  sensor  at  all; little  or  no  theoretical  vision
progress will necessarily result from  the achievement of spectacular
robotic  industrial assembly demonstations (hire  the handicap, blind
robots assembles  widgets).  On the  other  hand, since  the  missing
ingredient  for computer  vision  is the  spatial  modeling to  which
perceive images can be related; I believe that the development of the
technology for generating commercial film and  television by computer
for  entertainment might  make significant  contribution  to computer
vision.
{L0,-400;H2;X0.6;*HORNY6;}